Goto

Collaborating Authors

 target function


Feature Learning in Linear-Width Two-Layer Networks: Two vs. One Step of Gradient Descent

arXiv.org Machine Learning

We study feature learning in two-layer neural networks within the linear-width regime, where the number of hidden neurons, sample size, and input dimension scale proportionally. While recent work has analyzed feature learning via a single step of gradient descent on the first layer weights in this regime, such one-step update schemes are fundamentally limited: the update to the weights is approximately rank-one, captures only a single direction, and requires the target function to have an information exponent of one. In this paper, we go beyond one-step updates to provide a full characterization of the features learned during the \textit{second step} of gradient descent with step-sizes $ฮท_1\asymp N^{ฮฑ_1}$ and $ฮท_2 \asymp N^{ฮฑ_2}$ for $ฮฑ_1, ฮฑ_2 \in [0,0.5)$, where $N$ is the number of hidden neurons. We derive a spectral characterization of the updated weights, demonstrating they behave as a spiked random matrix with multiple outliers, each corresponding to a learned direction. We show that the number of the outliers is determined by the parameters $ฮฑ_1, ฮฑ_2$ through $\lfloor \frac{ฮฑ_2}{1/2 - ฮฑ_1} \rfloor$. Furthermore, by analyzing the alignment between the learned directions and the target function, we identify a gap between training with independent versus reused batches. While independent batches restrict learning to directions with an information exponent of one, batch reuse enables the second update to capture directions even when the information exponent exceeds one, provided that $ฮฑ_1, ฮฑ_2$ are chosen properly. This shows that the benefits of batch reuse, previously observed in narrow-width regimes, persist in the linear-width limit as well. By characterizing these early-phase evolutions, our work proposes a tractable framework for studying optimization and feature learning phenomenology in modern overparameterized networks.



Power and limitations of single-qubit native quantum neural networks

Neural Information Processing Systems

Quantum neural networks (QNNs) have emerged as a leading strategy to establish applications in machine learning, chemistry, and optimization. While the applications of QNN have been widely investigated, its theoretical foundation remains less understood. In this paper, we formulate a theoretical framework for the expressive ability of data re-uploading quantum neural networks that consist of interleaved encoding circuit blocks and trainable circuit blocks. First, we prove that single-qubit quantum neural networks can approximate any univariate function by mapping the model to a partial Fourier series. We in particular establish the exact correlations between the parameters of the trainable gates and the Fourier coefficients, resolving an open problem on the universal approximation property of QNN. Second, we discuss the limitations of single-qubit native QNNs on approximating multivariate functions by analyzing the frequency spectrum and the flexibility of Fourier coefficients. We further demonstrate the expressivity and limitations of single-qubit native QNNs via numerical experiments. We believe these results would improve our understanding of QNNs and provide a helpful guideline for designing powerful QNNs for machine learning tasks.




Locality defeats the curse of dimensionality in convolutional teacher-student scenarios

Neural Information Processing Systems

Convolutional neural networks perform a local and translationally-invariant treatment of the data: quantifying which of these two aspects is central to their success remains a challenge. We study this problem within a teacher-student framework for kernel regression, using'convolutional' kernels inspired by the neural tangent kernel of simple convolutional architectures of given filter size. Using heuristic methods from physics, we find in the ridgeless case that locality is key in determining the learning curve exponent ฮฒ (that relates the test error t P ฮฒ to the size of the training set P), whereas translational invariance is not. In particular, if the filter size of the teacher tis smaller than that of the student s, ฮฒ is a function of s only and does not depend on the input dimension. We confirm our predictions on ฮฒ empirically. We conclude by proving, under a natural universality assumption, that performing kernel regression with a ridge that decreases with the size of the training set leads to similar learning curve exponents to those we obtain in the ridgeless case.


4b5deb9a14d66ab0acc3b8a2360cde7c-Paper.pdf

Neural Information Processing Systems

For certain infinitely-wide neural networks, the neural tangent kernel (NTK) theory fully characterizes generalization, but for the networks used in practice, the empirical NTK only provides a rough first-order approximation. Still, a growing body of work keeps leveraging this approximation to successfully analyze important deep learning phenomena and design algorithms for new applications. In our work, we provide strong empirical evidence to determine the practical validity of such approximation by conducting a systematic comparison of the behavior of different neural networks and their linear approximations on different tasks. We show that the linear approximations can indeed rank the learning complexity of certain tasks for neural networks, even when they achieve very different performances. However, in contrast to what was previously reported, we discover that neural networks do not always perform better than their kernel approximations, and reveal that the performance gap heavily depends on architecture, dataset size and training task. We discover that networks overfit to these tasks mostly due to the evolution of their kernel during training, thus, revealing a new type of implicit bias.